Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance
نویسندگان
چکیده
Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures. For the test of Breiman and Cutler (2008), we investigate the statistical properties and find that the power of the test depends both on the sample size and the number of trees in an undesirable way that nullifies any significance judgments. Moreover, the specification of the null hypothesis of this test is discussed in the context of correlated predictor variables.
منابع مشابه
Zeileis Danger : High Power ! – Exploring the Statistical Properties of a Test for Random Forest Variable
Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures, especially in the context of correlated pred...
متن کاملIdentification of Statistically Significant Features from Random Forests
Embedded feature selection can be performed by analyzing the variables used in a Random Forest. Such a multivariate selection takes into account the interactions between variables but is not easy to interpret in a statistical sense. We propose a statistical procedure to measure variable importance that tests if variables are significantly useful in combination with others in a forest. We show e...
متن کاملInferring statistically significant features from random forests
Embedded feature selection can be performed by analyzing the variables used in a Random Forest. Such a multivariate selection takes into account the interactions between variables but is not straightforward to interpret in a statistical sense. We propose a statistical procedure to measure variable importance that tests if variables are significantly useful in combination with others in a forest...
متن کاملComparing Different Modeling Techniques for Predicting Presence-absence of Some Dominant Plant Species in Mountain Rangelands, Mazandaran Province
In applied studies, the investigation of the relationship between a plant species and environmental variables is essential to manage ecological problems and rangeland ecosystems. This research was conducted in summer 2016. The aim of this study was to compare the predictive power of a number of Species Distribution Models (SDMs) and to evaluate the importance of a range of environmental variabl...
متن کاملggRandomForests: Exploring Random Forest Survival
Random forest (Breiman 2001a) (RF) is a non-parametric statistical method requiring no distributional assumptions on covariate relation to the response. RF is a robust, nonlinear technique that optimizes predictive accuracy by fitting an ensemble of trees to stabilize model estimates. Random survival forests (RSF) (Ishwaran and Kogalur 2007; Ishwaran, Kogalur, Blackstone, and Lauer 2008) are an...
متن کامل